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Abstract 

In this paper we consider the Nonnegative Matrix Factorization (NMF) 
problem: given an (elementwise) nonnegative matrix V £ R™^" find, for 
assigned k, nonnegative matrices W G R™^*" and H G R*^" such that 
V = WH. Exact, non trivial, nonnegative factorizations do not always 
exist, hence it is interesting to pose the approximate NMF problem. The 
criterion which is commonly employed is I-divergence between nonnega- 
tive matrices. The problem becomes that of finding, for assigned k, the 
factorization WH closest to V in I-divergence. An iterative algorithm, 
EM like, for the construction of the best pair {W, H) has been proposed 
in the literature. In this paper we interpret the algorithm as an alternat- 
ing minimization procedure a la Csiszar-Tusnady and investigate some of 
its stability properties. NMF is widespreading as a data analysis method 
in applications for which the positivity constraint is relevant. There are 
other data analysis methods which impose some form of nonnegativity: 
we discuss here the connections between NMF and Archetypal Analysis. 
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1 Introduction 



The approximate Nonnegative Matrix Factorization (NMF) of nonnegative ma- 
trices is a data analysis technique only recently introduced El ■ Roughly 
speaking the problem is to find, for a given nonnegative matrix V € M™^", 
and an assigned k, a pair of nonnegative matrices W G M™^'' and H G K.'j^^" 
such that, in an appropriate sense, V « WH. In |5] EM like algorithms for the 
construction of a factorization have been proposed. The algorithms have been 
later derived in ^U] by using an ad-hoc auxiliary function, a common approach 
in deriving EM algorithms. In E] the connection with the classic alternating 
minimization of the I-divergence [2] has been pointed out but not fully inves- 
tigated. In this paper we pose the NMF problem as a minimum I-divergence 
problem that can be solved by alternating minimization and derive, from this 
point of view, the algorithm proposed in |9J. There are alternative approaches 
to approximate nonnegative matrix factorization. For instance, recently, see [3], 
results have been obtained for the approximate factorization (w.r.t. the Frobe- 
nius norm) of symmetric nonnegative matrices. 

Although only recently introduced the NMF has found many applications 
as a data reduction procedure and has been advocated as an alternative to 
Principal Components Analysis (PCA) in cases where the positivity constraint 
is relevant (typically image analysis). The title of E) is a clear indication of this 
point of view, but a complete analysis of the relations between NMF and PCA 
is still lacking. Our interest in NMF stems from the system theoretic problem of 
approximate realization (or order reduction) of Hidden Markov Models. Partial 
results have already been obtained 'H!. 

This paper is organized as follows. In sectionElwe pose the approximate non- 
negative matrix factorization problem, define the I-divergence between matrices 
and discuss the solution proposed in [HlEl- In sectional we pave the way for the 
alternating minimization algorithm presenting the properly lifted version of the 
minimization problem and solving the two partial minimizations in the style of 
Csiszar and Tusnady T . In section^jwe construct the alternating minimization 
algorithm and compute the iteration gain. One of the advantages of working 
with the lifted problem is that it sheds a new light also on the derivation of 
the algorithm via auxiliary functions given in '10'. In section [S] we will use the 
results of section |31 to construct a very natural auxiliary function to solve the 
original problem. A discussion of the convergence properties of the algorithm 
is given in sectional In the concluding section [3 we establish a connection be- 
tween the approximate NMF problem and the Archetypal Analysis algorithm 
of Cutler and Breiman The present paper is an extended version of 21- 

2 Preliminaries and problem statement 

The NMF is a long standing problem in linear algebra |H1 E| ■ It can be stated 
as follows. Given V G M™^", and 1 < fc < min{m, n}, find a pair of matrices 
W G M"'"' and H G Ml""" such that V = WH. The smallest k for which a 
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factorization exists is called the positive rank of V, denoted prank(F). This 
definition implies that rank(V^) < prank(y) < min{m, n}. It is well known that 
prank(y) can assume all intermediate values, depending on V. Examples for 
which nonnegative factorizations do not exist, and examples for which factor- 
ization is possible only for k > rank(T/) have been constructed in the literature 
The prank has been characterized only for special classes of matrices [T^ 
and algorithms for the construction of a NMF of a general positive matrix are 
not known. 

The approximate NMF has been recently introduced in j^j independently 
from the exact NMF problem. The set-up is the same, but instead of exact 
factorization it is required that V ~ WH in an appropriate sense. In and 
in this paper, the approximation is to be understood in the sense of minimum 
I-divergence. For two nonnegative numbers p and q the I-divergence is defined 
as 

p 

D{p\\q) =plog- ~p + q, 

with the conventions 0/0 = 0, OlogO — and p/Q — oo for p > 0. From the 
inequality xlogx > a; — 1 it follows that > with equality iS p = q. 

For two nonnegative matrices M = (Mij) and N — {Nij), of the same size, the 
I-divergence is defined as 



D{M\\N) = J2DiM,j\\N,j). 



Again it follows that D{M\\N) > with equality iff M = A^. For nonnegative 
vectors or tensors of the same size a similar definition applies. 
The problem of approximate NMF is to find for given V and a fixed number k 
(often referred to as the inner size of the factorization) 

aTgmiiiD(V\\WH). (1) 

The function D : {W,H) D{V\\WH) will sometimes be referred to as the 
objective function. The domain of D is the set of pairs {W, H) with nonnegative 
entries. The interior of the domain is the subset of pairs {W, H) with positive 
(> 0) entries, whereas pairs on the boundary have at least one entry equal to 
zero. 

Although the objective function {W,H) ^ D{V\\WH) is easily seen to be 
convex in W and H separately, it is not jointly convex in the two variables. 
Hence {W, H) ^ D{y\\W H) may have several (local) minima and saddle points, 
that may prevent numerical minimization algorithms to converge to the global 
minimizer. However D{V\\W H) cannot have a local maximum in an interior 
point (Wo,ifo)j because then also W i— > D{y\\WHo) would have a local max- 
imum in Wo, which contradicts convexity. Local maxima at the boundary are 
not a priori excluded. 

It is not immediately obvious that the approximate NMF problem admits a 
solution. The following result is therefore relevant. 
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Proposition 2.1 The minimization problem Q) has a solution. 
The proof of this proposition is deferred to section^ 

Notice that, increasing the inner size from fc to fc + 1, the optimal value of the 
objective function decreases. This follows from the fact that one can trivially 
embed the factorization problem with inner size k into the problem with inner 
size fc + 1 simply adding a zero last column to the optimal W and an arbitrary 
last row to the optimal H of the problem with inner size fc. Unfortunately, 
unlike the SVD of a matrix, the best approximations with increasing fc are not 
embedded one into another. For increasing fc the computations are to be carried 
out anew. 

Although, according to proposition l2.1l a solution to the minimization prob- 
lem exists, it will certainly not be unique. In order to rule out too many triv- 
ial multiple solutions, we impose the condition that H is row stochastic, so 
'Y^- Hij = 1 for all I. This is not a restriction. Indeed, first we exclude without 
loss of generality the case where H has one or more zero rows, since we would 
then in fact try to minimize the I-divergence with inner size smaller than fc. Let 
h be the diagonal matrix with elements hi — Hij , then WH — WH with 

W = Wh, H — h^^H and H is by construction row stochastic. The convention 
that H is row stochastic still does not rule out non- uniqueness. Think e.g. of 
post-multiplying W with a permutation matrix 11 and pre-multiplying H with 

n-i. 

Let e„ (e^) be the column (row) vector of size n whose elements are all equal 
to one. Given fc, the (constrained) problem we will look at from now on is 

min D{V\\WH). (2) 

W.H-.He^ =efc 

For the sake of brevity we will often write e for a vector of I's of generic size. 
The constraint in the previous problem will then read as He = e. 

To carry out the minimization numerically, Lee and Seung I10| proposed 
the following iterative algorithm. Denoting by and the matrices at step 
the update equations are 

^r^-^^E(^ (3) 

The initial condition (iy°,i7°) will always be assumed to be in the interior 
of the domain. Only a partial justification for this algorithm is given in |1(J| . 
although the update steps Q and are like those in the EM algorithm, known 
from statistics, see 0. Likewise the convergence properties of the algorithm are 
unclear. In the next section the minimization problem will be cast in a different 
way to provide more insight in the specific form of the update equations and on 
the convergence properties of the algorithm. 
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We will now show that the V matrix in the approximate NMF problem can 
always be taken as a probability matrix P i.e. such that Pij > 0, Pij — 1- 
This will pave the way for the probabilistic interpretation of the exact and 
approximate NMF problems to be given later. 

Let P = jTy^V, (5_ = -StW^' ^ ^ e'^We and Q+ = H. Notice that 
Pe — e^Q-e = 1 and Q+e = e. Using the definition of divergence and 
elementary computations, we obtain the decomposition 

D(y\\WH) = e'^VeD{P\\Q_Q+) + D{e'^Ve\\w). 

Hence, since the number e^Ve is known, minimizing D{V\\WH) w.r.t. {W,H) 
is equivalent to minimizing D{P\ \Q-Q+) w.r.t. (Q-, Q+) and D{e^Ve\\w) w.r.t. 
w. The minimizers of the three problems satisfy the relations W* = e^VeQ*_^ 
H* = Q'^, and w* ~ e^Ve. Minimizing D{V\\WH) is therefore equivalent to 
minimizing D{P\\Q^Q+). This enables us to give the problem a probabilistic 
interpretation. Indeed, 

Z?(P||0_g+) =^i^(P.,||(Q_Q+).,) =^P,,log— , (5) 

which is the usual I-divergence (Kullback-Leibler distance) between (finite) 
probability measures. This will be exploited in later sections. From now on 
we will always consider the following problem. Given the probability matrix P 
and the integer k find 

min D{P\\Q_Q+). 

Q- ,Q+:Q+e=e 

For typographical reasons we often, but not always, denote the entries of P by 
P{ij) instead of and likewise for other matrices. 

The minimization algorithm is easily seen to be invariant under the previous 
normalizations. Let Q*^_ = ^-r^t^ E^nd — . Substitute the definitions of 
(P, Q*_,Q\_) into and Q and use the easily verified fact that e^W*e = e^Ve 
for <: > 1 to obtain the update equations in the new notations 
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3 Lifted version of the problem 

In this section we lift the I-divergence minimization problem to an equivalent 
minimization problem where the 'matrices' (we should speak of tensors) have 
three indices. 
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3.1 Setup 

Let be given a probability matrix P (i.e. P{ij) > 0, J^ij -f(u) = 1) f-i^d 
integer k < min{r7i, ti}. We introduce the following sets 



T' = <^ P e 



]>m X A; X 



Q+ > 0, Q+e = e, e^Q-e = l}, 



|q e M";'^" : Q(ij) = ^ Q{tlj) for some Q G q| 



The interpretation of the sets V, Q, Q is given next. 

Suppose one is given random variables (Yl , X, Kf), taking values in {1, . . . , m} x 
{1, . . . , A:} X {1, . . . , n}. For convenience we can think of the r.v.'s as defined on 
the canonical measurable space {fl,J-), where O is the set of all triples and 
T is For lu = {i, I, j) we have the identity mapping (F_ , X, y+)(w) = (z, Z, j). 
If M a given probability measure on this space, then the distribution of the triple 
X, Y^) under M is given by the tensor R defined by 

- iR(y_ = X = /, y+ - j)- (8) 

Conversely, a given tensor R defines a probability measure R on We 
will use the notation D both for I-divergence between tensors and matrices 
and for the KuUback-Leibler divergence between probabilities. If P, Q are 
tensors related to probability measures P and Q like in (jSJ we obviously have 

D{vm = D{¥\m. 

The sets Q correspond to subsets of the set of all measures on [Vl^T). In 
particular P corresponds to the subset of all measures whose Y = (Y_,y_|_) 
marginal coincides with the given P, while Q corresponds to the subset of mea- 
sures under which Y^ and Y^ are conditionally independent given X. The first 
assertion is evident by the definition of T'. To prove the second assertion notice 
that if Q(r_ = i^X = l,Y+ = j) = Q(iZj) = Q-{il)Q+{lj), then summing 
over j one gets Q{Y- = i,X = I) = Q-{il) (since Q+e — e) and similarly 
= j\X = = Q+{lj)- It follows that Q{Y- ^ i,X ^ l,Y+ = j) = 
= i,X = Z)Q(y+ = j\X = I) which is equivalent to 

Q(r_ =i,Y+= j\X = l) = Q(y_ =t\X = l)Q{Y+ - j|X = 

i.e. y_,l+ are conditionally independent given X. 

Finally the set Q is best interpreted algebraically as the set of to x ?! probability 
matrices that admit exact NMF of size k. 

The following observation (taken from motivates our approach. 
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Lemma 3.1 P admits exact factorization of inner size k iffP Cl Q. =^ . 



Proof. If 7-* n Q 7^ then there exists a matrix Q G Q which also belongs to 
therefore P = Q-Q+. Conversely, if we have P = Q-Q+ with inner size 
fc, then the tensor P given by P(i^j) = Q^{il)Q+{lj) clearly belongs to V. As 
in section [5] we can w.l.o.g. assume that Q+e — e, so that P belongs to Q as 
well. □ 

We are now ready to give a natural probabilistic interpretation to the exact 
NMF problem. The probability matrix P admits exact NMF P = Q_Q+ iff 
there exists at least one measure on whose Y — (F_,y+) marginal is P 

and at the same time making and Y-^ conditionally independent given X . 

Having shown that the exact NMF factorization P = Q-Q+ is equivalent to 
n Q 7^ it is not surprising that the approximate NMF, corresponding to 
P n Q = 0, can be viewed as a double minimization over the sets V and Q. 

Proposition 3.2 Let P he given. The function (P, Q) ^ Z3(P||Q) attains a 
minimum on T x Q and it holds that 

mm D{P\\Q)= min i?(P||Q). 
Q^Q PeV.QeQ 

The proof will be given in subsection 13. 21 

Remark 3.3 Let P* and Q* be the minimizing elements in proposition 13. 21 If 
there is Iq such that J2ij P*(*^oj) = 0, then all Q*{iloj) are zero as well. Simi- 
larly, if there is Iq such that J^ij Q*(*^oj) = 0; then all F*{iloj) are zero as well. 
In each (and hence both) of these cases the optimal approximate factorization 
Q-Q"^ of P is of inner size less than k (delete the column corresponding to 
from Ql and the corresponding row of Q*^). 

3.2 Two partial minimization problems 

In the next section we will construct the algorithm for the solution of the double 
minimization problem 

min ^i?(P||Q), 

of proposition l3.2l as an alternating minimization algorithm over the two sets "P 
and Q. This motivates us to consider here two partial minimization problems. 
In the first one, given Q G Q we minimize the I-divergence Z?(P| |Q) over P e P. 
In the second problem, given P G 7-' we minimize the I-divergence £)(P| |Q) over 
Qg Q. 

Let us start with the first problem. The unique solution P* = P*(Q) can 
easily be computed analytically and is given by 
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where Q{ij) = X]/ Q(*0)- We also adopt the convention to put P*{ilj) = if 
Q{ij) = 0, which ensures that, viewed as measures, P* ^ Q. 

Now we turn to the second partial minimization problem. The unique solu- 
tion Q* = Q*(P) to this problem can also be easily computed analytically and 
is given by 

Qiiii) = Y.^m (10) 

3 

where we assign arbitrary values to the Q\{lj) (complying with the constraint 
Q+e — e) for those I with 'Yliij P(*^j) = 0. 

The two partial minimization problems and their solutions have a nice proba- 
bilistic interpretation. 

In the first minimization problem, one is given a distribution Q, which makes 
the pair Y = (Y^^Y^) conditionally independent given X, and finds the best 
approximation to it in the set "P of distributions with the marginal of Y given 
by P. Let P* denote the optimal distribution of {Y^^X,Y^). Equation Q can 
then be interpreted, in terms of the corresponding measures, as 

P*(r_ = = - j) = (Q(X = l\Y^ =i,Y+^ ])P{ij)- 

Notice that the conditional distributions of X given Y under P* and Q are the 
same. We will see below that this is not a coincidence. 

In the second minimization problem, one is given a distribution P, with the 
marginal of Y given by P and finds the best approximation to it in the set Q 
of distributions which make Y — (l"_,y+) conditionally independent given X. 
Let Q* denote the optimal distribution of (F_, X, Yj^). Equations (|1U|I and Hll|l 
can then be interpreted, in terms of the corresponding measures, as 

Q*(r_ ^i,x ^i)^ p(y_ = i,x ^i) 

and 

Q*{Y+=j\X = l)=F{Y+=j\X = l). 

We see that the optimal solution Q* is such that the marginal distributions of 
{X, Y-) under P and Q* coincide as well as the conditional distributions of 1+ 
given X under P and Q*. Again, this is not a coincidence, as we will explain 
below. 

Remark 3.4 As a side remark we notice that the minimization of D(Q||P) 
over P e P for a given Q G Q yields the same solution P*. A similar result 
does not hold for the second minimization problem. This remark is not relevant 
for what follows. 
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We can now state the so called Pythagorean rules for the two partial minimiza- 
tion problems. This terminology was introduced by Csiszar j^. 

Lemma 3.5 For fixed Q and P* — P*(Q) it holds that, for any P G P, 

i:>(P||Q) = i:i(P||p*) + i:i(p*||Q), (12) 

moreover 

D{P*\\Q) = D{P\\Q), (13) 

where 

I 

For fixed P and Q* = Q*(P) it holds that, for any Q £ Q, 

i^(P||Q) = 7^(P||Q*)+iP(Q*||Q). (15) 
Proof. To prove the first rule we compute 

i?(P||P*) + L>(P*||Q) 

^E^(.)i;iog§|^.(P„Q). 



The first rule follows. To prove the relation p3|) insert equation @ into 
£>(P*||Q) and sum over I to get 

ilj 

To prove the second rule we first introduce some notation. Let P(^^) = J2j 
P(-O') = EiP(*0') and P{j\l) = P(-0)/Ej P(-0)- For Q we use similar no- 
tation and observe that Q(iZ-) = Q^{il), and Q{j\l) ~ Q+{lj)/ J2j *3+(0): and. 
Q*_{il) = P{il-) and Q+{lj) = P(j|0- We now compute 

i^(PllQ) - i?(P||Q*) = J2 Pi'lj) fl°g ^T7^ + l°g ^HR 

= i^(Q*IIQ). 
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The second rule follows. □ 

With the aid of the relation H13|) we can now prove proposition 13. 21 

Proof of proposition 13.^ With P* — P*(Q), the optimal solution of the 
partial minimization over 7-', we have 

i:i(P||Q) > i:'(P*||Q) 

= DiP\\Q) 

> minD(PllQ). 

It follows that infpgp QggD(P||Q) > mingeQ D(P||Q). 

Conversely, let Q in Q be given and let Q be defined by Q{ij) — J2i Q(*0) • 

From 

D{P\\Q)=D{P*{Q)\\Q) 

> inf D{P\\Cl), 
PeV.QeQ 

we obtain 

minD(P||Q)> inf i?(P||Q). 
PeV.QeQ 

Finally we show that we can replace the infima by minima. Let Q*i and be 
such that {Q-, Q^) ^ D{P\\Q^Q^) is minimized (their existence is guaranteed 
by proposition l2.1|) . Let Q* be a corresponding element in Q and P* = P*(Q*). 
Then i:>(P*||Q*) = D{P\\Q*_QX) and the resuh follows. □ 

For a probabilistic derivation of the solutions of the two partial minimization 
problems and of their corresponding Pythagorean rules, we use a general result 
Qemma 13.61 below*) on the Ldivergence between two joint laws of any random 
vector (U,V). We denote the law of {U^V) under arbitrary probability mea- 
sures P and Q by P*-^'^ and Q'^'^. The conditional distributions of U given V 
are summarized by the matrices P^l^ and Q^'^, with the obvious convention 
= P(C7 = j\V = i) and likewise for Q^'^. 

Lemma 3.6 It holds that 

D{F^^^\\Q^'^) ^EpD{F"\^\\Q^\^) + D{P^\\Q^), (16) 

where 

P{U = j\V) 



^(pc/|y iiqC/IV) ^ ^ ^ j\v) log 



Q{u^j\v)- 



If moreover V = (Vi,V2), and U,V2 are conditionally independent given Vi 
under Q, then the first term on the RHS of lilb\) can be written as 

EpL'(P^I^IIQ^I^) =Epi:>(P^I^||P^I^O+IEpi?(P^I^MIQ^'^')- (17) 
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Proof. It follows from elementary manipulations. 



□ 



The first minimization problem can be solved probabilistically as follows. Given 
Q we are to find its best approximation within T . Let Q correspond to the given 
Q and P correspond to the generic P G P. Choosing U = X ,V = Y = (y_, 1 + ) 
in lemma 13.61 and remembering that P^ is determined by P for all P <E V, 
equation now reads 

D{P\\Q) = Epi?(P^'l^||Q^I^) + D{P\\Q), (18) 

where the matrix Q is as in (I14II . The problem is equivalent to the minimization 
of Epi:)(P^I^||Q^I^) w.r.t. P G P, which is attained (with value 0) at P* with 
p*x|Y _ Q-^l"^ and P*^ = P. To derive probabilistically the corresponding 
Pythagorean rule, we apply (|16|l with P* instead of Q. We obtain, using P^ = 

D(F^-y\\P*''-'') =EpD(P^I^||P*'^"'). (19) 
Since also 

EpL'(P^I'*'||Q^I'*') =Epi:>(P-^l^||P*''"''), (20) 

we combine equations (|19ll and H2U|) and insert the result into (|18|l . Recognizing 
the fact that ^(PIIP*) = i:>(P-^'^||P*'''''), and using i:»(P*||Q) = D{P\\Q) 
according to (|13|l . we then identify (|18|l as the first Pythagorean rule (|12|l . 

The treatment of the second minimization problem follows a similar pattern. 
Given P we are to find its best approximation within Q. Let P correspond to 
the given P and Q correspond to the generic Q G Q. Choosing U — Y+, Vi = X 
and V2 — in lemma and remembering that under any Q G Q the r.v. 
YL,1+ are conditionally independent given X, equation (|16|) refined with H17|l 
now reads 

i:>(P||Q) =Ep7:>(p^+i-^'^-||p^+i^) 

+ EpD (P^+ 1 ^ 1 1 1 ^ ) + D (P^- ■ ^ 1 1 Q^- ■ ^ ) . 

The problem is equivalent to the minimizations of the second and third I- 
divergences on the RHS w.r.t. Q G Q, which are attained (both with value 
0) at Q* with Q*^+\^ = P^+l^ and Q*^-^^ = ¥^-'^. Note that X has the 
same distribution under P and Q* . To derive probabilistically the corresponding 
Pythagorean rule we notice that 

D{P\\Q) - D{P\\Q*) ^Eq,D{Q*^+\^\\Q^+\^) + DiQ*''"'' (21) 

In the right hand side of (|21|l we can, by conditional independence, replace 
Eq,D{Q*^+\^\\Q^'+\^) with Eq,D{Q*^+\^'^-\\Q^+\^'^~). By yet another ap- 
plication of we thus see that ^^(PIIQ) - i:'(Pl|Q*) = £'(Q*1|Q), which is 
the second Pythagorean rule (|15|l . 
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4 Alternating minimization algorithm 

The results of the previous section are aimed at setting up an ahernating mini- 
mization algorithm for obtaining ming D{P\ \Q), where P is a given nonnegative 
matrix. In view of proposition 13.21 we can lift this problem to the 7^ x Q space. 
Starting with an arbitrary Q*^ g Q with positive elements, we adopt the follow- 
ing alternating minimization scheme 

-> Q* ^ P* ^ Q*+i ^ p*+i ^ (22) 
where P* = P*(Q*), Q*+i = Q*(P*). 

To relate this algorithm to the one of scction|21(formulas © and we combine 
two steps of the alternating minimization at a time. From H22(l we get 

Q*+i = Q*(P*(Q*)). 

Computing the optimal solutions according to (0, H1U|) and one gets from 
here the formulas ((HJ and Q of section |2] 

The Pythagorean rules allow us to easily compute the update gain D(P||(5') — 
D{P\\Q*+^) of the algorithm. 

Proposition 4.1 The update gain at each iteration of the algorithm \2^) in 
terms of the matrices Q* is given by 

D{P\\Q') - D{P\\Q'+^) = i?(P*||P*+i) + i?(Q*+i||Q*). (23) 

Proof. The two Pythagorean rules from lemma 1531 now take the forms 

D(P*||Q*) = 7^(P*||Q*+i) + i5(Q*+i||Q*), 
D(P*||Q*+i) = i:i(P*||P*+^) i:i(P*+i||Q*+i). 

Addition of these two equations results in 

D(P*||Q*) = L>(P*||P*+i) + i:>(P*+i||Q*+i) + i:)(Q*+i||Q*), 

and since £)(P*||Q*) = D{P\\Q^) from the resuh follows. □ 

Remark 4.2 If one starts the algorithm with matrices (Q° , Q\) in the interior 
of the domain, the iterations will remain in the interior. Suppose that, at step n, 
the update gain is zero. Then, from (|23(l . we get that /^(Q'+^IIQ*) — 0. Hence 
the tensors Q'"'""'^ and Q* are identical. From this it follows by summation that 
Q*_+i = Q*_. But then we also have the equality Q*_{il)Q*^^ilj) = Qt{il)Q+ilj) 
for all i, l,j. Since all Q^_{il) are positive, we also have = Q*+- Hence, the 
updating formulas strictly decrease the objective function until the algorithm 
reaches a fixed point. 
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We close this section with the proof of proposition l2.1l in which we use the result 
of proposition 14. II 

Proof of proposition 12. ll We first prove that there exists a pair of matrices 
iW,H) with He,n = ek and Wek = Ve,, for which Diy\\WH) is finite. Put 
W — jVcncJ and H = ^ry^ SkC^V. Note that indeed Hcm — ek and Wet = 
Vcn and that all elements of W and _ff , and hence those of WH , are positive, 
D{V\\WH) is therefore finite. 

Next we show that we can restrict ourselves to minimization over a compact 
set /C of matrices. Specifically, we will show that for all positive matrices W 
and if, there exist positive matrices W and H' with {W',H') g /C such that 
D{V\\W'H') < D{V\\WH). We choose for arbitrary W° and H° the matrices 
VF^ and according to l|3Jl and (QJ. It follows from proposition 14.11 that in- 
deed D{V\\W'^H^) < D(y\\W'^H°). Moreover, it is immediately clear from Q 
and Q that we have W^e = Ve and H^e = e. Hence, it is sufficient to confine 
search to the compact set C where He = e and We = Ve. 

Fix a pair of indices i, j. Since we can compute the divergence elementwise 
we have the trivial estimate 

D{V\\WH)> V, log j^^^^ - V, + {WH).,. 

Since for Vij > the function dij : x ^ Vij log ^ — Vtj + a; is decreasing 
on {0,Vij), we have for any sufhciently small e > (of course e < Vij) that 
dij{x) > dij{e) for x < e and of course lim^^o f^ij (^) — oo. Hence to find the 
minimum of dij, it is sufficient to look at x > e. Let Eq > and such that 
Eo < min{Fy : Vij > 0}. Let Q be the set of iW,H) such that iWH)ij > Eq 
for all i,j with Vij > 0. Then Q is closed. Take now IC = C O Q, then /C is 
the compact set we are after. Let us observe that /C is non-void for sufficiently 
small Eq. Clearly the map {W,H) is continuous on /C and thus 

attains its minimum. □ 



5 Auxiliary functions 

Algorithms for recursive minimization can often be constructed by using auxil- 
iary functions. For the problem of minimizing the divergence D{V\\WH), some 
such functions can be found in '10^ and they are analogous to functions that are 
used when studying the EM algorithm, see JH] . The choice of an auxiliary func- 
tion is usually based on ad hoc reasoning, like for instance finding a Lyapunov 
function for studying the stability of the solutions of a differential equation. 
We show in this section that the lifted version of the divergence minimization 
problem leads in a natural way to useful auxiliary functions. Let us first explain 
what is meant by an auxiliary function. 

Suppose one wants to minimize a function x i— > F{x), defined on some do- 
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main. The function {x,x') G{x,x') is an auxiliary function for F if 

G{x,x') > F{x'), \/x,x', 
G{x,x)^F{x), Va;. 

If we define (assuming that the argmin below exists and is unique) 

x' = x'(a;) = argminG(a;, •), (24) 

then we have 

F{x') < G{x,x') < G{x,x) = F{x), 

and hence the value of F decreases by replacing x with x' . A recursive procedure 
to find the minimum of F can be based on the recipe l|24|l by taking x = x* 
and x' = a;*"*"^. To be useful an auxiliary function G must allow for a simple 
computation or characterization of argmin G(a::, •). 

We consider now the minimization of D{P\\Q) and its lifted version, the 
minimization of Z3(P||Q) as in section |21 In particular, with reference to the 
alternating minimization scheme 122|) . with the notations of section^ we know 
that Q*"'""'^ is found by minimizing Q' £'(P*(Q*)||Q'). This strongly moti- 
vates the choice of the function 

(Q,Q')^G(Q,Q') = i?(P*(Q)||Q') 

as an auxiliary function for minimizing D(P\\Q) w.r.t. Q. 

Using the decomposition of the divergence in equation we can rewrite 
G as 

G(Q,Q') =i?(P*''||Q"')+Ep.i?(P*''"'||Q'^l^). (25) 

Since P*^l^ = Q^l"^, and P*"*^ = P we can rewrite ^ as 

G(Q, Q') - D{P\\Q"') + Epi?(Q^I^||Q'^l^). (26) 

From ||2nil it follows that G(Q, Q') > D{P\\Q'), and that G(Q, Q) = D{P\\Q), 
precisely the two properties that define an auxiliary function for £'(P||Q). 

In one can find two auxiliary functions for the original minimization problem 
D{V\ \WH). One function is for minimization over H with fixed W, the other for 
minimization over W with fixed H. To show the connection with the function 
G defined above, we first make the dependence of G on Q-,Q+, explicit 
by writing G(Q, Q') as G(g_ , Q+, Q'_, Q^). 

The auxiliary function for minimization with fixed Q_ can then be taken as 
Q'+ ^ G+(QV) = G(0_,Q+,Q_,QV), 
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whereas the auxihary function for minimization with fixed can be taken as 
Q-^Gq(Q'_) = G(Q_,Q+,Q'_,Q+) 

The functions Gq and Gq correspond to the auxihary functions in where 
they are given in an exphcit form, but where no rationale for them is given. 

For the different auxihary functions introduced above, we wiU now compute 
the update gains and compare these expressions with (|23|l . 

Lemma 5.1 Consider the auxiliary functions G ,Gq , Gq above. Denote by 
Q'_ and Q'_y_ the minimizers of the auxiliary functions in all three cases. The 
following equalities hold 

D{P\\Q^Q+) - G^{Q'_) = i5(Q'^-^^||Q^-'^) (27) 

D{P\\Q^Q+) - G+(OV) = Ep.i?(Q'^+l^||Q^+l^) (28) 
D{P\\Q_Q+) - G{Q.,Q+,Q'_,Q'+) = i?(Q'^-'^||Q^-'^) 

+ EQ,i?(Q'^+l^||Q^+l^). (29) 

Proof. We prove H29|l first. The other two follow from this. A simple compu- 
tation, valid for any Q-a. nd yields 

D(P||Q_Q+)-G(Q_,Q+,Q'_,QV) (30) 



Q(u) V Q-W Q+{lj) 

(31) 

Now we exploit the known formulas © and Q for the optimizing Q'_ and Q^. 
The first term in H31|) becomes in view of © (or, equivalently, in view of (jS)) 
and (Cni) 

which gives the first term on the RHS of (|29|l . Similarly, the second term in H31|l 
can be written in view of (jT)) as 

E(EQ'(^^^-))EQVa^-)iog|±||, 

which yields the second term on the RHS of formula (|29l) . Formulas 1)2 7|) and H28|l 
are obtained similarly, noticing that optimization of Gq and Gq separately yield 
the same Q'_^, respectively Q'_, as those obtained by minimization of G. □ 
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Remark 5.2 Notice that although for instance G'^{Q'_) > D{P\\Q'_Q'^) for 
all Q'_ and Q'_^, we have for the optimal Q'_ that Gq(Q'_) < D{P\\Q^Q+). 

Corollary 5.3 The update gain of the algorithm ^ can be represented by 
D{P\\Q') - D{P\ |Q*+i) =i^(Q*+i"- I IQ*"- ) 

+ EpZ?(Q*'"''l|Q*+i'"''). (32) 

Proof. Write 

D{P\\Q') ^ D{P\\Q'+^) ^ 

D{P\\Q') - G(Q*, Q*+i) + G(Q*,Q*+i) - DiP\\Q'+') 

and use equations H25|l and (|29|l . □ 

We return to the update formula H23() . A computation shows the following 
equalities. 

£,(pt||pt+i) ^Ep7:>((Q*'"''||Q*+i'"'') (33) 
D{Q*+'\\Q')=D{Q'+'"-"\\Q*"-") 

+ EQ.+ii^(Q*+i"^'"||Q*"^'"). (34) 



In equation (|33() we recognize the second term in the auxiliary function, see (|26() . 
Equation (|34|) corresponds to equation (|29|l of lemma 15.11 and we see that for- 
mula (|23|l is indeed the same as (|32|) . 

The algorithm (jSJ, (|7I) is to be understood by using these two equations simul- 
taneously. As an alternative one could first use ^ to obtain Q*^^ and, instead 
of using Q*_, feed this result into Q to obtain Q*^^. If we do this, we can ex- 
press the update gain of the first partial step, like in the proof of coroUarv 15.31 
by adding the result of equation (|27|) to the second summand of (|2()|l . with the 
understanding that Q' is now given by the Q*'^^{ij)Q'^{lj)- The update gain 
of the second partial step is likewise obtained by combining the result of H28|l 
and the second summand of (|25|l . with the understanding that now Q is to be 
interpreted as given by the Q'^^^{ij)Q*{lj)- Of course, as another alternative, 
the order of the partial steps can be reversed. Clearly, the expressions for the 
update gains for these cases also result from working with the auxiliary func- 
tions Gq and Gg, the equations (|77|l and and proceeding as in the proof 
of coroUarv 15. 31 



6 Convergence properties 

In this section we study the convergence properties of the divergence minimiza- 
tion algorithm ©, O. 
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The next theorem states that the sequences generated by the algorithm 
converge for every (admissible) initial value. Of course the limits will in general 
depend on the initial value. 

Theorem 6.1 Let Q^_{il), Q%{lj) he generated by the algorithm @), ^ and 
Q* the corresponding tensors. Then the Q^_{il) converge to limits Q°^{il) and 
the Q* converges to a limit Q°° in Q. The Q\{lj) converge to limits Q°^{lj) 
for all I with (iO > 0. 

Proof. We first show that the and Q\ form convergent sequences. We 
start with equation (|23|l . By summing over n we obtain 

t-i 

D{P\\Q^) ~ D{P\\Q') = (^(P1|P'+') + ^?(Q^+^||Q^)). 

k=l 

It follows that Efcli-C'(P1|P''+^) and Efeli -D(Q^+^ | |Q") are finite. Now we 
use that fact that for any two probability measures, the KuUback-Leibler diver- 
gence D(P||Q) is greater than or equal to their Hellinger distance 7J(P, Q), 
which is the distance between the square roots of corresponding densi- 
ties w.r.t. some dominating measure, see [131 p. 368]. In our case we have 
ff(Q«,Q-+i) = Y.a,{V^^^^Wj) - V^FWj)? ■ So we obtain that 

OO 

^i/(Q«+i,Q^)<cx). 
fe=i 

We therefore have that, pointwise, the tensors Q* form a Cauchy sequence 
and hence have a limit Q°°. We will show that Q°° belongs to Q. Since the 
Q'^iilj) converge to limits Q°°{ilj), by summation we have that the marginals 
Q-{il) = Q*{il-) converge to limits Q°°(iZ-) (we use the notation of the proof 
of lemma . and likewise we have convergence of the marginals Q*(-0) to 
Q°°(-0') andQ*(-;-) toQ°°(-?-). Hence, if Q°°(-/-) > 0, then the Q^Zj) converge 
to Q^(ij) := Q°°(-0')/Q°°(-^) and we have Q°°{ilj) = Q°°{il-)Q+ {ij). Now 
we analyze the case where Q°°{-Iq-) = for some Iq. Since in this case both 
Q°°(*^oj) and Q°°{ilo •) are zero, we have still have a factorization Q°°(i?oj) = 
Qf°(i^o)Q+ (^oj), where we can assign to the Q+ (^oj) arbitrary values. Let L 
be the set of / for which J2i Q-{il) > 0. Then Q°°(u) = J2ieL Q^WQ+i^j) 
and the Q* converge to Q°°. This proves the theorem. □ 

Remark 6.2 Theorem 16.11 savs nothing of the convergence of the Q\{li) for 
those I where ^^^^^(^O = 0- -But their behavior is uninteresting from a fac- 
torization point of view. Indeed, since the l-ih column of is zero, the values 
of the Z-th row of Q"^ are not relevant, since they don't appear in the product 
QooQOD ^ matter of fact, we now deal with an approximate nonnegative 
factorization with a lower inner size. See also remark f3.3l 

In the next theorem we characterize the properties of the fixed points of the 
algorithm. Recall from section[5]that the objective function has no local maxima 
in the interior of the domain. 
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Theorem 6.3 // {Q^,Q-^) is a limit point of the algorithm ^ in the 
interior of the domain, then it is a stationary point of the objective function D. 
If Q+) is a limit point on the boundary of the domain corresponding to an 
approximate factorization where none of the columns ofQ- is zero Q~{il) > 
for all I ), then all partial derivatives qq^^h^ o-nd QQ^(^ij) o'^e nonnegative. 

Proof. By computing the first order partial derivatives of the objective func- 
tion, using the middle term of equation |SJ), we can rewrite the update equa- 
tions ®, as 

/ \ 

Q*+\iO-Q-W -TJTT^ + l (35) 



dQ-{il) 



and 



where qq^^^^ij stands for the partial derivative ^^^-^-^^ evaluated at ((5*_,Q+) 
and likewise for a^^/, n . 

Let {Q-,Q+) be a limit point of the algorithm. Equations and ijc 
become 

It follows that we then have the relations 
and 

dD 

dQ+(lj) 

We first consider Suppose that for some i and / we have Q-{il) > 0, then 
necessarily gg^^^;) ~ 0. Suppose now that for some i,l we have Q^{il) = 

and that < 0. Of course, by continuity, this partial derivative will be 

negative in a sufficiently small neighborhood of this limit point. Since we deal 
with a limit point of the algorithm, we must have infinitely often for the iterates 
that Q^^(iZ) < Q*'_{il). From H35|l we then conclude that in these points we 
have QQ^(^ii^ > 0. Clearly, this contradicts our assumption of a negative partial 
derivative, since eventually the iterates will be in the small neighborhood of 
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the limit point, where the partial derivative is positive. Hence, we conclude 
that QQ^(^ii) > 0, if Q-{il) = 0. The proof of the companion statement for the 
Q+{lj) is similar. If Q^{lj) > 0, the corresponding partial derivative is zero. 
Let / be such that Q+{lj) — and suppose that we have that QQ^(^ij^ < 0. If 

we run the algorithm, then g^^^ifj / J2i Q*-^{il) converges to a negative limit, 
whereas X^i ^^(^O/ X^i ^^^(^O converges to one. Hence there is ?7 > such 
that eventually ^^f^ / J2,Q'+\il) < -277/8 and E^QU^l)/ E^Q'-'i^l) > 
1 — r]/3. Hence eventually we would have, see ij^ . 

which contradicts convergence of Q\{lj) to zero. □ 

Remark 6.4 If it happens that a limit point has a zero ^-th column, then it 
can easily be shown that the partial derivatives QQ^{ij) of D are zero. Nothing 
can be said of the values of the partial derivatives gQ^i^^ij for such I. But, 
see also remark 16.21 this case can be reduced to one with a lower inner size 
factorization, for which the assertion of theorem l6.3l is valid. 

Corollary 6.5 The limit points of the algorithm with ^iQ~iil) > for all 
I are all Kuhn- Tucker points for minimization of D under the inequality con- 
straints > and > 0. 

Proof. Consider the Lagrange function L defined by 

L{Q-,Q+) = D{P\\Q^Q+) -X-Q^-^i-Q+, 

where for instance the inner product A • Q_ is to be read as XiiQ-{il) for 
\a e M. Let us focus on a partial derivative qq^^^i) in a fixed point of the 
algorithm. The treatment of the other partial derivatives is similar. From the 
proof of theorem l6.3l we know that in a fixed point we have Q-{il)g^j^jjj = 0. 

Suppose that Q^{il) > 0, then qq^^^i-^ = and the Kuhn- Tucker conditions 
for this variable are satisfied with Xu = 0. If Q-{il) = 0, then we know from 
theorem 16.31 that qq^^^^ > 0. By taking A^; = gq^i^^i^ > 0, we see that also 
here the Kuhn- Tucker conditions are satisfied. □ 

Remark 6.6 Wu |il^ has a number of theorems that characterize the limit 
points of the closely related EM algorithm, or generalized EM algorithm. These 
are all consequence of a general convergence result in Zangwill JS]- The dif- 
ference of our results with his is, that we also have to consider possible limit 
points on the boundary, whereas Wu's results are based on the assumption that 
all limit points lie in the interior of the domain. 
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7 Relation with other minimization problems 



Other data analysis methods proposed in the hterature enforce some form of 
positivity constraint and it is useful to investigate the connection between NMF 
and these methods. An interesting example is the so called Archetypal Analysis 
(A A) technique 0]. Assigned a matrix X E jjmx" ^i^jj ^n integer k, the AA 
problem is to find, in the convex hull of the columns of X, a set of k vectors whose 
convex combinations can optimally represent X. To understand the relation 
between NMF and A A we choose the L2 criterion for both problems. For any 
matrix A and positive definite matrix S define \ \A\\^ — (tr(A"'"EA))^/^. Denote 
1 1^41 1/ = The solution of the NMF problem is then 

(W,H) = a.rgmm\\V ~WH\\ 

WM 

where the minimization is constrained to the proper set of matrices. The solu- 
tion to the AA problem is given by the pair of column stochastic matrices (A, B) 
of respective sizes k x n and m x k such that \\X — XBA\\ is minimized (the 
constraint to column stochastic matrices is imposed by the convexity). Since 
\\X - XBA\ \ = \\I - BA\\xTx the solution of the AA problem is 

(A.B) ^ argminWl - BAWxTx- 

AA and NMF can therefore be viewed as special cases of a more general problem 
which can be stated as follows. Given any matrix P e R™^", any positive 
definite matrix E, and any integer fc, find the best nonnegative factorization 
P ~ Q1Q2 (with Qi e W^'"', Q2 e in the L2 sense, i.e. 

((3i,Q2) = arg min ||F - (5iQ2||e- 

Qi ,Q2 
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